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Abstract 

The classification of high dimensional data with kernel methods is considered in this article. Exploit- 
ing the emptiness property of high dimensional spaces, a kernel based on the Mahalanobis distance is 
proposed. The computation of the Mahalanobis distance requires the inversion of a covariance matrix. 
In high dimensional spaces, the estimated covariance matrix is ill-conditioned and its inversion is unsta- 
ble or impossible. Using a parsimonious statistical model, namely the High Dimensional Discriminant 
Analysis model, the specific signal and noise subspaces are estimated for each considered class making 
the inverse of the class specific covariance matrix explicit and stable, leading to the definition of a par- 
simonious Mahalanobis kernel. A SVM based framework is used for selecting the hyperparameters of 
the parsimonious Mahalanobis kernel by optimizing the so-called radius-margin bound. Experimental 
results on three high dimensional data sets show that the proposed kernel is suitable for classifying high 
dimensional data, providing better classification accuracies than the conventional Gaussian kernel. 



1 Introduction 

High Dimensional (HD) data sets are commonly available for fully or partially automatic processing: For 
a relatively low number of samples, n, a huge number of variables, d, is simultaneously accessible. For 
instance, in hyperspectral imagery, hundreds of spectral wavelengths are recorded for a given pixel; in 
gene expression analysis, the measure of expression level of thousands of genes is typical; in customer 
recommendation systems for web services, to each potential client a high number of variables is associated 
(his past choices, his personal information ...) [TJ [2j [3]. For each sample, it is possible to have either 
numerical or alphabetical variables which can be sparse or with a different signal to noise ratio. 

In terms of processing, such data may need to be either classified, clustered, filtered or inversed in a 
supervised or unsupervised way. Although many algorithms exist in the literature for small or moderate 
dimensions (from Bayesian methods to Machine Learning techniques) most of them are not well suited to 
HD data. Actually, HD data pose critical theoretical and practical problems that need to be addressed 
specifically [2]. 

Indeed, HD spaces exhibit non intuitive geometrical and statistical properties when compared to lower 
dimensional spaces. Most of them do not behave in a similar way as in three dimensional Euclidean spaces 
(Table [I] summarizes the main properties of HD spaces) For instance, samples following a uniform law 
will have a tendency to have a high concentration in the corners [5] . The same property holds for normally 
distributed data: samples tend to have a high concentration in the tails [6], making density estimation a 
difficult task. This problem can be related to the number of parameters t to be estimated to fit a Gaussian 



distribution which grows quadratically with the space dimensionality, t = d(d + 3)/2 (5150 for d = 100). 
Because of this, conventional generative methods are not suitable for analyzing this type of data. 

Unfortunately, discriminative methods also suffer if the dimensionality is high, due to the " concentration 
of measure phenomenon" [5]. In HD spaces, samples tend to be equally distant from each other [7]. 
Hence, it is clear that Nearest Neighbors methods will definitively fail to process such data. Moreover, 
the Euclidean distance will not be appropriate to assess the similarity between two samples. In fact, it is 
has been shown that every Minkowski norm (||x|| m — ( l^li=i l^| m ) ^ ; ^ — 1,2...) is affected by this 
phenomenon [8]. Therefore, every method based on the distance between samples [9j (SVM with Gaussian 
kernel, neural network, Nearest Neighbors, Locally Linear Embedding. . . ) are potentially affected by this 
phenomenon |10[ [TT] . 

An additional property, for which the consequences are more practical than theoretical, is the "empty 
space phenomenon" |12| : In HD spaces, the available samples usually fill a very small part of the space. 
Therefore, most of the space is empty. Note that if originally the empty space phenomenon was considered 
as a problem, it will be seen in the following that it is actually the basis of several useful statistical models. 

Today, the phrasing 11 curse of dimensionality", originally from R. Bellman [12] . refers to the aforemen- 
tioned problems of HD data and reflects how processing HD data is difficult. However, as D. Donoho has 
noticed [2j, there is also a 11 Blessing of dimensionality": For instance in classification, the class separability 
is improved when the dimensionality of the data increases. Consider for example a comparison between 
hyperspectral (hundreds of spectral wavelengths) and multispectral (tens of spectral wavelengths) remote 
sensing images[13j. The former contains much more information, and enables a more accurate distinction 
of the land cover classes. However, if conventional methods are used, the additional information con- 
tained in hyperspectral images will not lead to an increase of the classification accuracy [5]. Hence, using 
conventional methods, classification accuracies remains low. 

Several methods have been proposed in the literature to deal with HD data for the purpose of classi- 
fication. A highly used strategy is Dimension Reduction (DR). DR aims at reducing the dimensionality 
of data by mapping them onto another space of a lower dimension, without discarding any, or as less as 
possible, of the meaningful information. Recent overviews of DR can be found in |14[ [T5| I16j . Two main 
approaches can be defined. 1) Unsupervised DR: The algorithms are applied directly on the data without 
exploiting any prior information, and project the data into a lower dimensional space, according to some 
criterion (data variance maximization for PCA, independence for ICA . . . ). 2) Supervised DR: Training 
samples are available and are exploited to find a lower dimensional subspace where the class separability is 
improved. Fisher Discriminant Analysis (FDA) is surely one of the most famous supervised DR method. 
However, FDA maximizes the ratio of the "between classes" scatter matrix, Sb, and the "within classes" 
scatter matrix, S w . The optimal solution is given by the eigenvector corresponding to the first eigenvalues 
of S^j Sb- In HD, S^ 1 is in general ill-conditioned which limits the effectiveness of the method. Other 
popular DR methods such as Laplacian eigenmaps, Isomap or Locally Linear Embedding |15[ Chapter 4 
and 5] may be also limited by the dimensionality since they are based on the Euclidean distance between 
the samples. One last drawback of DR methods is the risk of losing relevant information. In general, DR 
methods act globally, which can be a problem for classification purpose: Different classes may be mapped 
onto the same subspace, even if the global discrimination criteria is maximized. 

An alternative strategy to DR has been recently proposed, i.e., the subspace models [T7]. These models 
assume that each class is located in a specific subspace and consider the original space without DR for 
the processing. For instance, the Probabilistic Principal Component Analysis (PPCA) |18j assumes that 
the classes are normally distributed in a lower dimensional subspace and are linearly embedded in the 
original subspace with additive white noise. Such models exploit the empty space property of HD data 
without discarding any dimension of the data |T9, 20j. A general subspace model that encompasses several 
other models is the High Dimensional Discriminant Analysis (HDDA) model, proposed by Bouveyron et 
al. (2D [22]. 

Conversely, kernel based methods do not reduce the dimensionality but rather work with the full 
HD data [23] . These discriminative methods are known to be more robust to size of the dimensionality 
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Table 1: Summary of HD spaces properties. 



High Dimensional Spaces 



Curse Blessing 



Poor statistical estimation Emptiness 
Concentration of measure Class separability 



than conventional generative methods. However, local kernel methods are sensitive to the size of the 
dimensionality |24| . A kernel method is said to be local if the decision function value for a new sample 
depends on the neighbors of that sample in the training set. Since in HD data the neighborhood of a 
sample is mostly empty, such local methods are negatively impacted by the dimension. For instance, SVM 
with Gaussian kernel 



is such a local kernel method. 

In this paper, it is proposed to use subspace models to construct a kernel adapted to high dimen- 
sional data. The chosen approach for including subspace models in a kernel function is to consider the 
Mahalanobis distance, ds c , between two samples for a given class, c, with covariance matrix, S c : 



Previous works on the Mahalanobis kernel [25, 26 , 271 [28] were limited by the effect of dimensionality on the 
matrix inversion. In [25], the covariance matrix was computed on the whole training set. The associated 
implicit model is that the classes share the same covariance matrix, which is not true in practice. Diagonal 
and full covariance matrices were investigated in |26| for the purpose of classification and in |27j for the 
purpose of regression. However, in a similar way, the covariance matrix was computed for all the training 
samples. Computing the covariance matrix for the Mahalanobis distance with all the training samples is 
equivalent to project the data on all the principal components, scale the variance to one, and then applying 
the Euclidean distance. By doing so, classes could overlap more than in the original input space and the 
discrimination between them would be decreased. 

In this work, the HDDA model is used for the definition of a class specific covariance matrix adapted 
for HD data. The specific signal and noise subspaces are estimated for each considered class, ensuring 
a parsimonious characterization of the classes. Following the HDDA model it is then possible to derive 
an explicit formulation of the inverse of the covariance matrix, without any regularization or dimension 
reduction. The parsimonious Mahalanobis kernel is constructed by substituting the Euclidean distance 
with the Mahalanobis distance computed using the HDDA model. It is proposed in this work to define 
several hyperparameters in the kernel to control the influence of the signal and noise subspaces in the 
classification process. These hyperparameters are optimized during the training process by the minimiza- 
tion of the so-called radius margin bound of the SVM classifier. Compared to the previous works on 
the Mahalanobis kernel for HD data, the proposed method allows the use of a more complex model, a 
separate covariance matrix per class, with higher efficiency in terms of accuracy. The remainder of the 
paper is organized as follows. The subspace model and the proposed kernel are discussed in Section [2] 
The problem of selecting the hyperparameters for classification with SVM is addressed in Section [3] The 
Section [4] details the estimation of the size of the signal subspace. Results on simulated and real high 
dimensional data are reported in Section [5] Conclusions and perspectives conclude the paper. 
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2 Regularized Mahalanobis Kernel 



2.1 Review of HDDA model 

The most general HDDA sub-model is used in this worlQ i.e., each class has his own specific subspace. 
Here, we will review the HDDA model but restricted to the problem of the covariance matrix inversion. 
However HDDA was originally proposed for classification or clustering with Gaussian mixture model. 
Interested readers can find a detailed presentation of HDDA in |21| I22j . 

In subspace models, it is assumed that the data from each class are clustered in the vector space. This 
cluster does not need to have an elliptic shape but it is generally assumed that the data follow a Gaussian 
distribution. The covariance matrix of the class c can be written through its eigenvalue decomposition: 

£ c = Q C A C Q* 

where A c is the diagonal matrix of eigenvalues A c j, i S {1, . . . , d}, of S c and Q c is the matrix that contains 
the corresponding eigenvectors q c j. The HDDA model assumes the p c first eigenvalues are different and the 
remaining d — p c eigenvalues are identical. The model is similar to PPCA, but more general in the sense 
that additional sub-models can be defined. In particular, the intrinsic dimension p c are not constrained in 
HDDA whereas there are assumed to be equal for each class in PPCA. 
Under the HDDA framework, the covariance matrix has the following expression: 

Pc d 
i=l i=p c +l 

where the last d — p c eigenvalue are equal to b c . The inverse can be computed explicitly by 

Pc 1 1 d 

i=l c% c i=p c +l 



^ Ac 

This statistical model can be understood equivalently by a geometrical assumption: For each class, the 
data belong to a cluster that lives in a lower dimensional space A c , namely the signal subspace. The 
original input space can be decomposed as M. d = A c (J) A c (by construction A is the noise subspace which 
contains only white noise). Figure [l] gives an illustration of that in M 3 . 

Using I = Yli=i QciQci' I being the identity matrix, the inverse can be finally written as 

i=i 

Standard likelihood maximization shows that the parameters (A^, q c i)j=i,... » c and b c can be computed 
from the sample covariance matrix |21| : 



— } (Xj - X c ) (Xi - x c ) 



1=1 



where x c is the sample mean for class c and n c the number of samples of the class. A c j is estimated by 
the i first eigenvalue of S c , q c j by the corresponding eigenvector and b c = (trace(S! c ) — Ya=i a «) /{d — Pc) 
(the estimation of the dimension p c of the subspace is discussed later) . The last d — p c eigenvalues and 
their corresponding eigenvectors are not needed for the computation of the inverse in 

defers to {cnjbiQidi] in [2T1I22] . 
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Figure 1: Cluster-based model. The distance between x and z is computed both in the signal subspace 
and in the noise subspace. Note that in this example dim(^4 c ) < dim(^4 c ), but for real data it is usually 
the opposite. || • ||q s is the dot product in A c and || • ||q is the dot product in A c - 

The major advantage of such a model is that it reduces drastically the number of parameters to estimate 
for computing the inverse matrix. Indeed, with the full covariance matrix, d(d + 3) /2 parameters are to 
be estimated. With the HDDA model, only d{p c + 1) + 1 — p c {Pc — l)/2 parameters are to be estimated. 
For instance, if d = 100 and p c = 10, 5150 parameters are needed for the full covariance and only 1056 for 
the HDDA model. Furthermore, the stability is improved since the smallest eigenvalues of the covariance 
matrix and their corresponding eigenvectors, which are difficult to compute accurately, are not used in 
Finally, using the HDDA model, the square Mahalanobis distance for class c is approximated by 

4> z ) = £ (r - i) n^(x - z) ii 2 + | l x ^. ( 3 ) 

This approach relies on the analysis of the empirical covariance matrix, as with PCA. But instead of 
keeping only significant eigenvalues, ([3j considers all the original space, without discarding any dimension. 
This has two main theoretical advantages over the conventional PCA: 

1. Two samples may be close in the signal subspace but far apart in the original space, which is a 
problem for classification tasks. It can be handled by considering the noise subspace together with 
the signal subspace. Consider for instance z, z' and x in Figure In A, z' seems closer to x than 
z, while it is not as it can be seen by adding A in the distance computation. 

2. An accurate estimation of the signal subspace size p c is necessary if PCA is applied: The worst 
scenario being p c « p c , i.e., relevant eigenvectors are discarded. By considering both the signal 
and the noise subspaces, the method becomes less sensitive to p c . Even in the worst case scenario, 
the eigenvectors are still considered. 

2.2 Mahalanobis Kernel 

The regularized Mahalanobis kernel for class c is constructed by substituting ^ to the Euclidean distance 
in the Gaussian kernel (1) and switching eigenvalues (\ c i,b c ) to hyper parameters (cr 2 j, c^+l) ^ na ^ are 
optimized during the training step: 

c) = 

(4) 
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(a) (b) (c) 

Figure 2: Values of the kernel function fc(0,x) with = [0,0] and x G [—1,1] x [—1,1]. The red line 
represents the contour line for the value 0.75. (a) is the Gaussian kernel, (b) is the kernel Q with 
a\ = a 2 = 0.5, (c) is the kernel ^ with a\ = 1.5 and er| = 0.5. The covariance matrix used was 
[0.6 — 0.2; —0.2 0.6] and the signal subspace was of dimension 1, spanned by the first eigenvector of the 
covariance matrix. 

where <j c j, i G {1, . . . , ]5 C + 1} are the hyperparameters of the kernel. As described in Section [3j these 
parameters are tuned during the training step. The hyperparameters have been introduced for the following 
reason. It is known that the principal directions are not optimal for classification since they do not 
maximize any discrimination criterion. However, they still span a subspace where there are variations 
in the data of the considered class. The hyperparameters a c i allow to control which directions are more 
relevant (or discriminative) for the classification process: The feature space is modified during the training 
process to ensure a better discrimination between samples. 

It is interesting to note that the regularized Mahalanobis kernel can be expressed as the product of 
Gaussian kernels: 

Pc 

fc m (x,z|c) = fc s (x,z) x JjA^q^q^z). (5) 

i=l 

The feature space induced by the kernel and the influence of the hyperparameters is analyzed in the next 
section. 



2.3 Geometry of the induced feature space 

Working with a kernel function is equivalent to work with samples mapped onto a feature space T~L, where 
the dot product is equivalent to the kernel evaluation in the input space [231 [29] : 



fc(x,z) = (0(x),0(z)) 



H, 



(j) being the feature map. Under some weak conditions, the projected samples in the feature space live on 
a Riemannian manifold |30[ [3"T] . The metric tensor is 



<9 2 /c(x,z) 



dxidzj 



(6) 



which is, for the Gaussian kernel, 5ij(x) = a~ 2 5ij with 5ij = 1 if i = j and otherwise. This metric 
stretches or compresses the Euclidean distance between x and z by a factor a~ 2 . Each variable is assumed 
equally relevant for the given task, e.g., classification or regression. 
For the kernel M| the metric tensor is: 



=1 °cl a cp c +l 
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with q c n the i th element of q c ^. The distance between two samples is stretched (if a% > 1) or compressed 
(if — -*■) a l° n g the p c first principal components of class c (first term of the right part of the equation) 
and along the original components (last term of the equation). In other words, each principal component 
is weighted according to its relevance for the processing. 

The analysis of the metric tensor exhibits the nature of the proposed kernel for a given class: It is 
a mixture of a Gaussian kernel on the original variables and a Gaussian kernel on the p c first principal 
components of the considered class. The hyper parameters u c / are tuned during the training process. This 
allows the optimization of the weight of each kernel. If a\ = +00, V/ G {1, . . . ,p c }, reduces to the 
conventional Gaussian kernel. On the contrary, if & 2 p c+ i = +°°> (4|) reduces to the Gaussian kernel on the 
p c first principal components. Figure [2] shows the kernel values fordifferent values of the hyper parameters. 
Note that opposite to the Gaussian kernel, the kernel in (Q is not isotropic. 

The following section reviews the basics of SVM classifier and presents how the hyperparameters are 
computed. 



3 L2-SVM and Radius margin bound optimization 

Support vector machines (SVM) is a standard classification kernel methods [32]. It has shown to performs 
very well on several data sets from moderate dimension to high dimensional data |33[ 134], In the following 
section, the main results are presented but interested readers could see references |32^ [35] [36] for further 
mathematical details about the SVM framework. 



3.1 L2-Support Vector Machines 

The L2-SVM is considered in this work rather than the conventional Ll-SVM [ 35] : With L2-SVM it is 
possible to tune the hyperparameters automatically by optimizing the so called radius-margin bound |37] . 
The L2-SVM solves the conventional Ll-SVM optimization problem with a quadratic penalization of 
errors [35] . Given a training set S = {(xi, y\ ),..., (x n , y n )}, ( x i,?/i) G M d x { — 1; 1}, the parameters 
(ai)2 = i and b of the decision function /, 

n 

/( z ) = ^ajfc(xi,z) + b, 
i=i 

are found by solving the convex optimization problem: 



n 1 n 

max g(a) = - - ^ aiajyiyjk^Xj) 

subject to < Q.i and OL%\ji = 



t=l 

where k(Xi,Xj) = k(x.i,Xj) + C~ 1 5ij with k the kernel function and C a positive hyperparameter that is 
used to penalize the training errors. 

An estimate of the generalization errors is given by an upper bound on the number of errors of the 
leave-one-out procedure, the radius-margin bound T [36j : 

T(p) := 1Z 2 M 2 . (9) 

TZ 2 is the radius of the smallest hypersphere that contains all </>(xj), M 2 is the margin of the classifier, 
it is given by the optimal objective function of (|8|), and p are the hyperparameters. In our setting 
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+1 ,C]. TZ 2 is obtained by the optimal objective function of the following constraint 
optimization problem |36j : 



max </(/?) 



subject to < /3{ and 5^ A = 1- 



(10) 



i=l 



Since both 7£ 2 and .M 2 depend on p, it is possible to optimize T to set the hyperparameters. Chapelle el 
al. |37| . followed later by S. S. Keerthi |38j . have proposed an algorithm based on gradient optimization 
method. It is discussed in the following section. 

With the proposed kernel, if two classes i and j are considered, the classifier for "i vs j" is not the same 
as the classifier for "j vs i" since the kernel function is specific to the classes i and j, respectively. Indeed, 
for a multiclass problem, the "one vs one" approach must not be used and the "one vs all" approach should 
be preferred [39J. 



3.2 Radius-margin bound Optimization 

Computing the gradient of T requires the computation of the gradient of the following expression^] 

n n 

M 2 = 2}ai- aiajyiyjk(xi, x 3 -) 
i=i ij=i 



111 



and of 



(12) 



i=i 



where (ai)f =1 and (/3j)™ =1 are the optimal parameters of (|8j) and (10). The gradient of (|ll| depends on 
on, which depends on p (similar comments hold for (12)). Chapelle et al. have proven that since M 2 and 



TZ are computed via an optimization problem, the gradients of b\ and do not enter into account in the 



computation of their gradients |37]. Hence, the gradient of (10) can be written as: 

;yt i)T rrr 

VT 



with 



and 



dC>doV-do 2 c+l 



8T dTZ 2 ka2 ^ 2 dM 2 
-Mr + TZ- 



dC dC 



dC 



daj daj 

for £ £ {I, . . . ,p c + 1}. The derivatives of TZ 2 are 

dTZ 2 



8T dTZ 2 ka2 ^ 2 dM 2 
M 2 +TZ 2 



da 2 



dC 

8TZ 2 
da 2 



C 2 



1 



da 



(13) 

(14) 
(15) 

(16) 
(17) 



2 For simplicity, the parameter c of the kernel function is omitted, i.e., fe(x, z|c) is written as fc(x,z) 
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Figure 3: Scree test of Cattell. The threshold s is set to 10%. The estimated p c is 14. See Section 5.1 for 
a description of the data set. The horizontal axis represents the index i. The vertical axis represents the 
numerical difference between two consecutive eigenvalues, Aj. The curve represents the difference between 
two consecutive eigenvalues and the horizontal line represents 10 % of the highest difference. The arrow 
shows the estimated p c . 



with 



dfc(xj,Xj) 
da 2 



(7 



k(xi,Xj) if i e {1, . . . ,p c } 



(18) 



a , 



^(^j; Xj 



The derivatives of .M 2 are 



dM 2 _ J_A- 
i=i 

dM 2 A dk{xi,xj) 

-- - ^ a iaj y iyj — — — 



da 2 



(19) 
(20) 



Once the derivatives have been computed, the optimization of T is done through a conventional gradient 
descent, following the framework in [37J. At each iteration t, the set of hyper parameters are updated as 
with a step proportional to the negative of the gradient of T: 



V = p - 7VT 

where 7 > is a step size parameter. For implementation details, see [40J. 



4 Estimation of p c 

The size of the signal subspace was estimated by the scree test of Cattell |3T] using the same methodology 
as in [21J. The test consists in comparing the difference, Aj, between two consecutive eigenvalues Aj 
and Aj.fi, Aj = Aj — When the differences Aj are below a user-defined threshold s for all i, i.e., 

Aj < s,Vj £ {i, ... ,d — 1}, p c is estimated as p c = i. In general the threshold is a percentage of the 
highest difference. Figure [3] shows an example on a simulated data set (see Section 5.1 for a description 
of the data). The correct value in that case is p = 10 but its estimate is p = 14. 
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(a) (b) (c) 

Figure 4: Simulated spectra: (a) si, (b) S2 and (c) x. The horizontal axis is the variable and the vertical 
axis is the simulated reflectance. The parameters of the simulation are: a = [0.6,0.4], d = 413, p = 10, 
N c = 2 and SNR=1. 

5 Experimental results 

Classification results are presented in this section. Regarding the multiclass strategy, the results must 
be considered as individual binary classification problems: No fusion rules were applied. For instance, in 
Table |4j the results for the class "Asphalt" should be read as "Asphalt vs all". The reason for the use 
of this approach is the better interpretation of the results which is obtained because the results are not 
biased by the multiclass fusion strategy. 

5.1 Classification of simulated data following HDDA model 

In this section, the proposed kernel, namely the HDDA-Mahalanobis Kernel (HDDA-MK), is used with 
the SVM in classification and evaluated on simulated data. The performances in terms of classification 
accuracy have been compared to a SVM with a conventional Gaussian kernel on the original data and 
on the data projected on the first principal axis of the considered classes, called the PCA-Mahalanobis 
kernel. The main difference between the HDDA-MK and the PCA-Mahalanobis kernel is that the PCA- 
Mahalanobis kernel discards the noise subspace while the HDDA-MK also exploits the noise subspace in 
order to improve the class discrimination. As previously stated, Gaussian kernel and PCA-Mahalanobis 
kernel correspond to extreme cases of HDDA-MK: crj = +00, \/p € {1, . . . ,Pn c } 01 a p+i = 
Simulated data were constructed using a linear mixture model |42j : 

N c 

X = J2 a i s i + h ( 21 ) 
i=\ 

where N c is the number of classes, y = j such as otj = maxjOj, b ~ A/"(0,e 2 I) and Sj follows the HDDA 
model. The mean values of Sj were extracted from a spectral library provided with the ENVI software 
used in hyperspectral imagery [43J. The number of spectral variables d was set to 413 and p c was set to 
10 for each class. The noise variance e 2 was adjusted to get a SNR = 1. Three experiments were run for 
a different number of classes, i.e., 2, 3 and 4 classes, respectively. Figure [4] presents two simulated spectra 
and their linear mixture. The number of training samples was 1000 and the number of testing samples was 
1500. The experiment was repeated 50 times for each configuration. The hyperparameters were estimated 
using the radius margin bound, for each classifier. Since no difference in terms of classification accuracies 
were observed, we only report the results of the "1 vs all" classifier. 

5.1.1 Estimation of p c 

With simulated data, it is possible to assess how the size of the intrinsic signal subspace is estimated. 
Figure [5] presents the boxplots of the estimations. After several tries, the threshold for the scree test was 
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5: Boxplot of the estimation of p c for the three configurations. The vertical axis represents values 
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6: Boxplots of the classification accuracies for the three experiments, (a) N c = 2, (b) iV c 
= 4. The vertical axis represent the overall classification accuracies. 
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fixed to 10%. Fixing the threshold to a too high value would lead to underestimate p while a too low value 
would lead to drastically overestimate it. 

From the figure, the scree test overestimates the parameter p, for each configuration. The variance of 
the estimation is larger when the number of classes is increased while the bias of the estimation decreases. 
However, the error in estimating p is not too important with regards to the original size of the data 
(d = 413). Furthermore, in previous work [44] , another criterion (the BIC [45J) was used to estimate the 
correct dimension of the subspace where the data live. The BIC criterion showed poor results when the 
number of training samples for single class n c was close to the dimension of the data {d ~ n c ). From the 
experiments, the scree test is more robust in such a situation. 



5.1.2 Classification accuracies 

The percentages of correct classification are reported in Figure [6] For the three experiments, the proposed 
kernel leads to the best results in terms of accuracies. Although p c was overestimated, it did not penalize 
the performances of the algorithm in terms of classification accuracies. For N c = 2, the second best result 
is provided by the PCA-Mahalanobis kernel, while for N c = 3 or 4 it is provided by the Gaussian kernel 
applied on the original data. For instance, for N c = 4, the mean value of correct classification is 92.2% 
for the HDDA-Mahalanobis kernel, 91.3% for the conventional Gaussian kernel and only 76.3% for the 
PCA-Mahalanobis. The results confirm the poor generalization capability of the Mahalanobis kernel when 
dealing with high dimensional spaces. Although the conventional Gaussian kernel is less sensitive to the 
problem, the proposed kernel gives a significant improvement of the classification accuracy. 
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Table 2: Madelon data: Percentage of samples correctly classified. 





Class 1 


Class 2 


Mean 


Gaussian 


69.7 


69.7 


69.7 


PCA-Mahalanobis 


83.3 


81.8 


82.5 


HDDA-Mahalanobis 


84.1 


83.8 


83.9 



Table 3: Arcene data: s is the threshold value in the scree test, (fti,P2) correspond to the estimated size 
of the signal subspace for each class. The number is the percentage of samples correctly classified. 





s 


0.2 0.1 0.05 0.01 0.005 0.001 




(pufo) 


(2,2) (2,3) (3,4) (8,7) (11,10) (22,23) 


PCA-Mahalanobis 


Class 1 
Class 2 


69.0 69.0 70.0 75.0 78.0 51.0 
72.0 72.0 62.0 67.0 72.0 75.0 


HDDA-Mahalanobis 


Class 1 
Class 2 


80.0 80.0 80.0 80.0 83.0 51.0 
80.0 80.0 81.0 80.0 81.0 74.0 



5.2 Classification of Madelon data 

Madelon data set is a simulated data set used for the NIPS Feature Selection Challenge^ It has 5 useful 
features, 15 redundant features and 480 random probes for a total of 500 features (d = 500). It is composed 
of two classes and the number of training samples is 2000 and the number of testing sample is 600. The 
threshold for the scree test has been set to 20%. The proposed kernel has been compared to the same 
kernels as in the previous section. 

The classification results in terms of accuracies are reported in Table [2] The conventional Gaussian 
kernel performs badly on that data set with a global precision of 69,7% (random classifier would achieve 
50%). The best accuracy is obtained for the proposed kernel with an average accuracy of 83.9%. The size 
p c of the signal subspace for the two classes was p\ = 4 and p2 = 4, respectively. The results obtained with 
the PCA-Mahalanobis kernel are worse than those obtained with the HDDA-Mahalanobis kernel. Thus, 
it confirms the pertinence to use both the signal and the noise subspace from the HDDA model in the 
classification. 



5.3 Classification of Arcene data set 

Arcene data set is data set used for the NIPS Feature Selection Challenge. It has 7000 real variables, 3000 
random probes for a total of 10000 features [d = 10000). It is composed of two classes. The number of 
available training samples is 100 and the number of test samples is 100. Therefore, the number of training 
samples is very small in comparison with the number of variables. About 50% of the data are non zero. 

The classification accuracies are reported in Table [3] The selected threshold for the scree test has 
been set to 0.5%. The results obtained for other values of the threshold are also reported for comparison. 
The Gaussian kernel achieves a global accuracy of 80%. It performs quite well in terms of classification 
accuracies related to the dimension of the data. The PCA-Mahalanobis kernel performs worst whatever the 
threshold value. For the HDDA-Mahalanobis kernel, for the highest value of s the results are equal to those 
obtain with the Gaussian kernel. Then a slight increased of the accuracy is observed for s = 0.005. When 
the size of the signal subspace is too large (s = 0.0001 and (pi,P2) = (22, 23)), too many hyperparameters 
have to be estimated and thus the classification accuracy becomes low. 

"http : //www.nipsf sc . ecs . soton.ac.uk/ 



12 



Table 4: Classification accuracies for the different kernels in percentage of correctly classified samples. In 
the first column, the numbers in brackets represent the total number of training and testing samples for 
each class, respectively. 



Pc 


Gaussian 


PCA-Mahalanobis 


HDDA-Mahalanobis 


Asphalt (548, 6631) 


12 


94.8 


95.8 


95.8 


Meadow (540, 18649) 


10 


79.4 


83.6 


82.1 


Gravel (392, 2099) 


9 


97.2 


97.5 


97.2 


Tree (524, 3064) 


14 


94.3 


98.2 


98.2 


Metal Sheet (265, 1345) 


7 


99.8 


99.9 


99.9 


Bare Soil (532, 5029) 


9 


87.8 


85.9 


88.4 


Bitumen (375, 1330) 


21 


98.8 


98.7 


99.0 


Brick (514, 3682) 


12 


96.7 


97.2 


97.2 


Shadow (231, 947) 


14 


99.9 


99.9 


99.9 


Average class accuracy 


94.3 


95.2 


95.3 



0.9 



o.8 .^y s . 

0.7 / 
0.6 / 

10 20 30 

Figure 7: Classification accuracies as a function of the dimension of the signal subspace for the class 
meadow. The horizontal axis represents the value p c and the vertical axis represents the classification 
accuracies for the class meadow. Solid line corresponds to the PCA-Mahalanobis kernel and the dashed 
line corresponds to the HDDA-Mahalanobis kernel. The selected value with the scree test is 10. 

5.4 Classification of real hyperspectral data 

The data set considered in this experiment is the University Area of Pavia, Italy, acquired with the ROSIS- 
03 sensor. The image has 103 spectral variables, i.e., each pixel is represented by a vector with 103 features 
(d=103) |46j . Nine classes have been defined by photo-interpretation as seen in first column of Table [IJ 
Here, the threshold was set to 0.01%, because of a very high value of the first principal component (mainly 
due to the albedo). 

Classification results are reported in Table [4] The proposed kernel leads to an increase of the accuracy, 
compared to the conventional Gaussian kernel. However, for this data set, the PCA-Mahalanobis and 
HDDA-Mahalanobis kernel perform equally well in terms of accuracies, except for the classes meadow and 
bare soil. 

To assess the influence of p c on the classification accuracies, the class meadow has been classified for 
a range of values of p c . The results are reported in the Figure [7] for the PCA-Mahalanobis and HDDA- 
Mahalanobis kernels. From the figure, the optimal p c is 11 which is close to the value selected with the 
scree test {p c = 10). The cumulative variance is 99.72% for p c = 10, 99.75% for p c = 11 and 99.77% 
for p c = 12. The proposed kernel is slightly influenced by the choice of p c for that class, compared to 
PCA-Mahalanobis kernel. In particular, the proposed kernel is robust if p c is underestimated. However, 
if p c is heavily overestimated, too many parameters would need to be estimated and it could degrade the 
training process. 
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Table 5: Processing time in seconds of the competive methods for four different data sets. 





Arcene 


Madelon 


Asphalt 


Meadow 


Gaussian 


2.4 


31.9 


113 .0 


87.4 


PCA-Mahalanobis 


14.2 


24.0 


193.0 


185.7 


HDDA-Mahalanobis 


14.5 


80.8 


94.2 


195.3 



5.5 Analysis of the processing time 

To assess the computational load of the proposed method, the processing time was computed for four data 
sets. The two data sets from the NIPS Feature Selection Challenge were used and well as the two first 
classes (Asphalt and Method) of the hyperspectral data set. The results are reported in Table [5] The 
program runs under Matlab on a two cores 2.67GHz laptop. 

For the Arcene data set, the Gaussian kernel shows the lowest computational time. The computation 
of the first eigenvalues/eigenvectors is demanding since the data have 10000 features. It requires about 
14 seconds. The optimization of the hyperparameters is fast since a few number of training samples are 
available. For the Madelon data sets, the computation of the firs eigenvalues/eigenvectors is fast, about 
1.3 seconds while the optimization of the kernel hyperparameters is more demanding. For that data set, 
the HDDA-Mahalanobis is the slowest. Regarding the University data set, for the first two classes, the 
estimation of the first eigenvalues/eigenvectors is very fast, about 0.3 second. However, the estimation 
of the kernel hyperparameters is more demanding than with the two others data sets. For the Asphalt 
problem, the HDDA-Mahalanobis performs the fastest, while it is the slowest for the Meadow problem. 

From the above results, it is difficult to point out a clear winner in terms of processing time. From 
a computational viewpoint, the optimization of the Gaussian kernel is less demanding than PCA- or 
HDDA-Mahalanobis kernels. However, since the optimization problem is a gradient descent on a non- 
convex function, a local minimum might be found sooner with one method and make it faster than the 
others. Nevertheless, the HDDA-Mahalanobis is usually more demanding in terms of processing time. 
Computational complexity of PCA- and HDDA-Mahalanobis can be assumed to be comparable in terms 
of processing time. 

6 Conclusions 

In this paper, a novel kernel adapted to high dimensional data has been proposed. The parsimonious 
Mahalanobis kernel is based on the emptiness property of HD spaces. For each class, the original input 
space is split into a signal subspace and a noise subspace. Using this assumption, the inversion of the 
covariance matrix in the Mahalanobis kernel can be accurately computed. The proposed kernel was tested 
in a SVM framework for the purpose of classification. Experimental results on four data sets demonstrate 
the potential of the proposed kernel. In each case, the classification accuracy increased compared to the 
conventional Gaussian kernel and for three cases the proposed kernel showed superior results to simply 
map the data on the first PCA axes. Consequently, for HD data the HDDA-Mahalanobis kernel should 
be prefered. 

Regarding the computational load, the HDDA-Mahalanobis kernel is more demanding during the 
training process than the Gaussian kernel since more hyperparameters have to be estimated. Besides 
that, the HDDA-Mahalanobis kernel is efficient when the dimension of the input space is high. Hence, for 
moderate or small dimensions, conventional kernels should be preferred. 

In the article, only classification is investigated. But other processing could also have been considered, 
e.g., regression |27] , The critical point is to be able to tune the hyperparameter efficiently, which it is 
feasible for regression. However, the actual optimization of the hyperparameters is demanding in terms of 
computations and it is sensitive to local minima. Therefore, a different strategy must be studied. 

Perspectives of this work concern the development of new kernels using the HDDA-model. For instance, 
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it is possible to define a new dot product for the conventional polynomial kernel: 

k(x, z|c) = (x'S^z + l) r . (22) 

Furthermore, a mixture of kernels using the HDDA could be used. From the HDDA-Mahalanobis 
kernel can be considered as a product of several kernels. In the future, we will investigate the summation 
of kernels. 

Finally, a free-parameter alternative to the scree test for estimation of the intrinsic dimension must be 
addressed. For instance, an maximum likelihood estimator for HDDA exits and must be investigated |47| . 
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